1. Introduction

Description of the Project

This project uses the World Bank’s Education Statistics dataset to explore long-term trends and relationships among key indicators of the US education system. Our goal is to understand how structural aspects of education (access, staffing, and outcomes) have evolved over time and how changes in one variable may influence changes in another. We analyze variables across different categories and examine differences between sexes. By focusing on longitudinal, cross-national data, we aim to identify potential patterns and meaningful interactions that could inform educational policy and investment.

The analysis of various indicators allows us to examine temporal changes and potential external factors that might shape educational outcomes.
We selected two major indicator categories that reflect different dimensions of the education system:

  1. Enrollment Rate vs. Attendance Rate
    We will compare enrollment and attendance rates across different levels of schooling (primary, secondary, tertiary) to test whether higher enrollment leads to consistent attendance over time.

  2. Number of Teachers vs. Teaching Staff Compensation
    We will investigate how changes in staffing levels correspond with compensation across educational stages.

Motivation

When we were brainstorming datasets to go with, we had difficulty choosing a topic because the possibilities were endless. We looked at several options from public health to climate to economics, but we eventually narrowed it down to education, which was something that all of us are connected to and that our classmates and readers could easily relate to. We initially considered focusing on Emory-specific data but broadened our scope to the World Bank’s Education Statistics for its larger context and long-term coverage.

As students at an American university, we were especially interested in examining data trends within the United States. Education policy and access here have evolved significantly over time, and exploring long-term patterns in enrollment, attendance, and staffing can reveal whether increased investment in education translates into better participation and equity. This directly relates to us, as college students who have undergone the three main levels of education and were told that our own efforts, most notably in high school, would determine our outcomes later in life. But how much do individual outcomes actually depend on the student, and how much on the school systems and structures that support them?


2. Dataset


3. Setup & Packages

We will perform Exploratory Data Analysis (EDA) in R using the tidyverse suite of libraries.
Our analysis will include the following components:

Before proceeding, we load all necessary packages and set global options for reproducibility and consistent output formatting.

# install.packages("tidyverse")
# install.packages("readr")
# install.packages("knitr")
# install.packages("plotly")
library(tidyverse) # core data manipulation & visualization
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr) # efficient CSV import
library(knitr)
library(plotly) # interactive data visualization
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(janitor) # column name cleaning & simple tabulations
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(lubridate) # data parsing and time handling

set.seed(42)
opts_chunk$set(echo=TRUE, message=FALSE, warning=FALSE)
message("All packages loaded.")
## All packages loaded.

4. Data Import & Cleaning

We start from reading the raw EdStats dataset (EdStats_v01.csv) and filter only the United States(Country code == "USA"). We write the filtered dataset to EdStats_USA.csv so later chunks only work with US data.

We also drop fully empty columns (2024 column) and write each subset to a CSV for reuse in later chunks.

This simplifies the dataset to the country of interest and removes all-NA columns to keep downstream transformations clean.

#set up data for use
data <- read_csv("EdStats_v01.csv")

#clean up data to have USA data
data_clean <- data %>%
  filter(grepl("USA", `Country code`)) %>%
  select(-`2024`) #all na
#make sure no NA cols remain
colnames(data_clean)[colSums(is.na(data_clean)) == nrow(data_clean)]
## character(0)
write_csv(data_clean, "EdStats_USA.csv")

To keep downstream steps tidy and reproducible, we create three thematic subsets aligned to our research questions:

  1. Enrollment vs. Attendance — indicators with “total net enrolment rate” or “total net attendance rate” (exclude parity indices and sex-specific series for apples-to-apples totals).
    • Note that “enrolment” is misspelled, this is corrected later in code.
    • Dropped all-NA columns and saved as EdStats_attend.csv.
  2. Teachers vs. Compensation — teacher counts and teaching staff compensation (exclude sex splits and qualification percentages).
    • Dropped all-NA columns and saved as EdStats_teacher.csv.

This ensures we work with clean, directly comparable variables.

# 1. Enrollment vs. Attendance
# filter for enrollment/attendance info
data <- read_csv("EdStats_USA.csv")
data_filter_attend <- data %>%
  filter(grepl("total net enrolment rate|total net attendance rate", `Indicator name`, ignore.case = TRUE)) %>%
  filter(!grepl("female|male", `Indicator name`)) %>%
  filter(!grepl("adjusted gender parity index", `Indicator name`, ignore.case = TRUE)) %>%
  select(where(~!all(is.na(.)))) %>% view() %>%
  write_csv("EdStats_attend.csv")

# 2. Teachers vs. Compensation
#filter for num teachers and teaching staff compensation info
data <- read_csv("EdStats_USA.csv")
data_filter_teacher <- data %>%
  filter(grepl("teachers in|teaching staff compensation", `Indicator name`, ignore.case = TRUE)) %>%
  filter(!grepl("female|male", `Indicator name`)) %>%
  filter(!grepl("percentage of qualified", `Indicator name`, ignore.case = TRUE)) %>%
  select(where(~!all(is.na(.)))) %>% view() %>%
  write_csv("EdStats_teacher.csv")

5. Attendance vs. Enrollment — Reshape, Compare, and Quantify

The following chunk expands our analysis of EdStats_attend.csv to quantify and visualize the relationship between school enrollment and attendance across education levels in the United States.
After reshaping the dataset into tidy long form, we standardize indicator names to capture both spellings of “enrolment/enrollment” and collapse duplicate series to obtain a single observation for each year × level × type combination.
From these, we compute a new variable representing the attendance gap (Enrollment – Attendance) to measure how much access (enrollment) translates into sustained participation (attendance).

Reading tip:
Focus on whether attendance consistently trails enrollment and whether that gap changes across levels or decades. High correlation indicates structural progress—both metrics rising together—while a persistent positive gap suggests that access alone does not guarantee participation.

data_attend <- readr::read_csv("EdStats_attend.csv", show_col_types = FALSE)

# 1) Long format with robust labeling (catch both 'enrolment' and 'enrollment' spellings)
data_long <- data_attend %>%
  pivot_longer(cols = matches("^\\d{4}$"), names_to = "Year", values_to = "Value") %>%
  mutate(
    Year = as.numeric(Year),
    # Standardize type labels
    Type = case_when(
      str_detect(`Indicator name`, regex("attendance", ignore_case = TRUE)) ~ "Attendance",
      str_detect(`Indicator name`, regex("enrol?ment", ignore_case = TRUE)) ~ "Enrollment",  # enrolment or enrollment
      TRUE ~ NA_character_
    ),
    # Standardize level labels
    Level = case_when(
      str_detect(`Indicator name`, regex("primary", ignore_case = TRUE)) ~ "Primary",
      str_detect(`Indicator name`, regex("lower\\s*secondary", ignore_case = TRUE)) ~ "Middle",
      str_detect(`Indicator name`, regex("upper\\s*secondary", ignore_case = TRUE)) ~ "High",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(!is.na(Type), !is.na(Level), !is.na(Value))

# 2) Collapse duplicates within (Year, Level, Type)
#    (If the catalog has multiple series for the same concept, average them.)
collapsed <- data_long %>%
  group_by(Level, Year, Type) %>%
  summarize(Value = mean(Value, na.rm = TRUE), .groups = "drop")

# 3) Wide format with both columns side-by-side
gap_df <- collapsed %>%
  pivot_wider(names_from = Type, values_from = Value) %>%
  mutate(Gap = Enrollment - Attendance)

# Quick diagnostics (see what exists)
print(table(gap_df$Level, useNA = "ifany"))
## 
##    High  Middle Primary 
##      14      18      22
print(summary(gap_df[, c("Enrollment", "Attendance")]))
##    Enrollment      Attendance   
##  Min.   :85.90   Min.   :94.64  
##  1st Qu.:96.25   1st Qu.:96.55  
##  Median :97.77   Median :97.22  
##  Mean   :97.07   Mean   :97.10  
##  3rd Qu.:99.17   3rd Qu.:98.00  
##  Max.   :99.97   Max.   :98.32  
##  NA's   :3       NA's   :36
# If there are no overlapping rows, bail out gracefully
if (nrow(drop_na(gap_df, Enrollment, Attendance)) < 2) {
  warning("No overlapping Enrollment & Attendance observations after cleaning; skipping correlation and scatter.")
} else {
  # Correlation and scatter only when data exist
  corr_df <- gap_df %>% drop_na(Enrollment, Attendance)

  r_val <- cor(corr_df$Enrollment, corr_df$Attendance, use = "pairwise.complete.obs", method = "pearson")
  message(sprintf("Correlation between Enrollment and Attendance: r = %.2f", r_val))

  p_scatter <- ggplot(corr_df, aes(Enrollment, Attendance, color = Level)) +
    geom_point(alpha = 0.6) +
    geom_smooth(method = "lm", se = FALSE, linewidth = 0.9) +
    labs(
      title = sprintf("Attendance vs. Enrollment (r = %.2f)", r_val),
      x = "Enrollment Rate (%)", y = "Attendance Rate (%)"
    ) +
    theme_minimal(base_size = 12)
  print(p_scatter)
}

# 4) Trend lines (unchanged, but use 'collapsed' to avoid duplicates)
p_trend <- ggplot(collapsed, aes(Year, Value, color = factor(..group..))) +
  geom_line(linewidth = 1, na.rm = TRUE, aes(color = after_stat(NULL))) # placeholder; we'll fix color below

# Better: rebuild trend with explicit mapping
p_trend <- ggplot(collapsed, aes(Year, Value, color = Type)) +
  geom_line(linewidth = 1, na.rm = TRUE) +
  geom_point(size = 1.5, alpha = 0.7) +
  facet_wrap(~ Level, ncol = 3) +
  scale_color_manual(values = c("Attendance" = "#FF69B4", "Enrollment" = "#3BBF8F")) +
  labs(title = "Attendance vs Enrollment over Time (USA)", y = "Rate (%)", color = "Series") +
  theme_minimal(base_size = 12)
print(p_trend)

# 5) Gap over time (only where both exist)
if (any(!is.na(gap_df$Gap))) {
  p_gap <- ggplot(gap_df, aes(Year, Gap, color = Level)) +
    geom_hline(yintercept = 0, linetype = "dashed") +
    geom_line(linewidth = 1, na.rm = TRUE) +
    geom_point(size = 1.3, alpha = 0.8) +
    labs(title = "Enrollment – Attendance Gap (percentage points)", y = "Gap (pp)", x = "Year") +
    theme_minimal(base_size = 12)
  print(p_gap)
}

# 6) Summary stats (use collapsed to avoid duplicate inflation)
summary_stats <- collapsed %>%
  group_by(Type, Level) %>%
  summarize(
    n = n(),
    mean_rate = mean(Value, na.rm = TRUE),
    median_rate = median(Value, na.rm = TRUE),
    sd_rate = sd(Value, na.rm = TRUE),
    min_rate = min(Value, na.rm = TRUE),
    max_rate = max(Value, na.rm = TRUE),
    IQR = IQR(Value, na.rm = TRUE),
    .groups = "drop"
  )
readr::write_csv(summary_stats, "tendency_attend.csv")
summary_stats
## # A tibble: 6 × 9
##   Type       Level       n mean_rate median_rate sd_rate min_rate max_rate   IQR
##   <chr>      <chr>   <int>     <dbl>       <dbl>   <dbl>    <dbl>    <dbl> <dbl>
## 1 Attendance High        6      96.4        96.6   0.583     95.3     96.8 0.383
## 2 Attendance Middle      6      97.7        98.1   0.966     95.8     98.3 0.493
## 3 Attendance Primary     6      97.2        97.8   1.32      94.6     98.0 0.805
## 4 Enrollment High       13      94.4        94.9   2.46      90.9     98.8 4.17 
## 5 Enrollment Middle     17      98.8        99.2   1.11      96.3    100.0 1.05 
## 6 Enrollment Primary    21      97.3        98.0   2.85      85.9     99.6 1.24

6. Teachers vs. Compensation — Reshape, Compare, and Quantify

This section examines how teacher workforce size and compensation have evolved in the United States from 2014 to 2021.
Using EdStats_teacher.csv, we identify indicators relating to the number of teachers and the share of educational expenditure allocated to teaching staff compensation.
After filtering out non-teaching staff and qualification percentages, the data are reshaped into tidy long form to create two new analytical variables:

Teacher counts are scaled per 10 000 to make them visually comparable to percentages.
We then:

Reading tip:
Focus on whether the compensation share rises or falls with the number of teachers.
Parallel upward trends may indicate sustained investment in instructional personnel, while divergence could suggest funding reallocation, changing class sizes, or reporting gaps across education levels.

# 1) Read & filter (keep only teaching staff; drop qualification % series)
teacher_raw <- read_csv("EdStats_teacher.csv", show_col_types = FALSE) %>%
  filter(!str_detect(`Indicator name`, regex("non-?teaching staff", ignore_case = TRUE))) %>%
  filter(!str_detect(`Indicator name`, regex("percentage of qualified", ignore_case = TRUE)))

# 2) Long format (auto-detect year columns) + standardized labels
teachers_long0 <- teacher_raw %>%
  pivot_longer(cols = matches("^\\d{4}$"), names_to = "Year", values_to = "Value") %>%
  mutate(
    Year = as.numeric(Year),
    Type = case_when(
      str_detect(`Indicator name`, regex("compensation", ignore_case = TRUE)) ~
        "Compensation % of Total Expenditure",
      str_detect(`Indicator name`, regex("teachers? in|number of teachers", ignore_case = TRUE)) ~
        "Number of Teachers",
      TRUE ~ NA_character_
    ),
    Level = case_when(
      str_detect(`Indicator name`, regex("\\bpre-?primary\\b", ignore_case = TRUE)) ~ "Pre-Primary",
      str_detect(`Indicator name`, regex("\\bprimary\\b", ignore_case = TRUE)) ~ "Primary",
      str_detect(`Indicator name`, regex("lower\\s*secondary", ignore_case = TRUE)) ~ "Lower Secondary",
      str_detect(`Indicator name`, regex("upper\\s*secondary", ignore_case = TRUE)) ~ "Upper Secondary",
      str_detect(`Indicator name`, regex("\\bsecondary\\b", ignore_case = TRUE)) ~ "Secondary (unspecified)",
      str_detect(`Indicator name`, regex("\\btertiary\\b|post-?secondary|higher education", ignore_case = TRUE)) ~ "Tertiary",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(!is.na(Type), !is.na(Level), !is.na(Value))

# 3) Collapse duplicates within (Year, Level, Type) to ensure 1 row per cell
teachers_long <- teachers_long0 %>%
  group_by(Level, Year, Type) %>%
  summarise(Value = mean(Value, na.rm = TRUE), .groups = "drop")

# 4) Scale teacher counts to 10,000s for visual comparability with percentages
teachers_long <- teachers_long %>%
  mutate(Value_scaled = if_else(Type == "Number of Teachers", Value / 10000, Value))

# (Optional) order levels for consistent faceting
lvl_order <- c("Pre-Primary", "Primary", "Lower Secondary", "Upper Secondary",
               "Secondary (unspecified)", "Tertiary")
teachers_long <- teachers_long %>%
  mutate(Level = factor(Level, levels = intersect(lvl_order, unique(Level))))

# ---- A. Time-series trends by level (scaled) ----
p_line_teacher <- ggplot(teachers_long, aes(x = Year, y = Value_scaled, color = Type)) +
  geom_line(linewidth = 1.1, na.rm = TRUE) +
  geom_point(size = 1.7, alpha = 0.8) +
  facet_wrap(~ Level, ncol = 3, scales = "fixed") +
  theme_minimal(base_size = 12) +
  theme(strip.background = element_rect(fill = "grey95", color = NA)) +
  scale_color_manual(
    values = c("Number of Teachers" = "#FF69B4",
               "Compensation % of Total Expenditure" = "#3BBF8F"),
    labels = c("Number of Teachers (per 10,000)", "Compensation (% of total)")
  ) +
  labs(
    title = "Teachers & Compensation Trends by Level (USA)",
    y = "Scaled value (Teachers per 10,000; Compensation %)",
    color = "Series"
  ) +
  scale_x_continuous(breaks = pretty(unique(teachers_long$Year)))
p_line_teacher

# Interactive view (kept optional for HTML)
ggplotly(p_line_teacher)
# ---- B. Coverage diagnostics (counts by Type / Year / Level) ----
p_count_type <- ggplot(teachers_long, aes(x = Type)) +
  geom_bar(fill = "#c562f0") +
  theme_minimal(base_size = 12) +
  labs(title = "Coverage: Count of Observations by Type", x = "Series", y = "Count")
p_count_type

p_count_year <- ggplot(teachers_long, aes(x = factor(Year))) +
  geom_bar(fill = "#c562f0") +
  theme_minimal(base_size = 12) +
  labs(title = "Coverage: Observations per Year (all levels)", x = "Year", y = "Count")
p_count_year

p_count_level <- ggplot(teachers_long, aes(x = Level, fill = Type)) +
  geom_bar(position = "dodge") +
  theme_minimal(base_size = 12) +
  labs(title = "Coverage: Observations by Level and Type", x = "Level", y = "Count", fill = "Series")
p_count_level

# ---- C. Correlation between staffing and compensation (where both exist) ----
# Wide table on the *scaled* values so axes are comparable
wide_tc <- teachers_long %>%
  select(Level, Year, Type, Value_scaled) %>%
  pivot_wider(names_from = Type, values_from = Value_scaled)

# overall (pooled) correlation with guard
corr_tbl <- tibble::tibble()

valid_overall <- wide_tc %>%
  mutate(valid = complete.cases(`Number of Teachers`, `Compensation % of Total Expenditure`)) %>%
  filter(valid)

if (nrow(valid_overall) >= 3) {
  overall_r <- cor(valid_overall$`Number of Teachers`,
                   valid_overall$`Compensation % of Total Expenditure`,
                   use = "complete.obs", method = "pearson")
  corr_tbl <- bind_rows(corr_tbl, tibble(Level = "All levels (pooled)", r = overall_r))
}

# by-level correlations (no cur_data(); compute within each group)
by_level_r <- wide_tc %>%
  group_by(Level) %>%
  summarise(
    n_pairs = sum(complete.cases(`Number of Teachers`, `Compensation % of Total Expenditure`)),
    r = ifelse(
      n_pairs >= 3,
      cor(`Number of Teachers`, `Compensation % of Total Expenditure`,
          use = "complete.obs", method = "pearson"),
      NA_real_
    ),
    .groups = "drop"
  ) %>%
  filter(!is.na(r))

corr_tbl <- bind_rows(corr_tbl, by_level_r %>% select(Level, r))
corr_tbl
## # A tibble: 6 × 2
##   Level                        r
##   <chr>                    <dbl>
## 1 All levels (pooled)     -0.313
## 2 Pre-Primary             -0.571
## 3 Primary                  0.551
## 4 Lower Secondary         -0.830
## 5 Upper Secondary         -0.862
## 6 Secondary (unspecified) -0.334
# ---- D. Summary statistics (per Type × Level) ----
summary_stats_teachers <- teachers_long %>%
  group_by(Type, Level) %>%
  summarise(
    n = n(),
    mean_value = mean(Value, na.rm = TRUE),
    median_value = median(Value, na.rm = TRUE),
    sd_value = sd(Value, na.rm = TRUE),
    min_value = min(Value, na.rm = TRUE),
    max_value = max(Value, na.rm = TRUE),
    IQR = IQR(Value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(Type, Level)

write_csv(summary_stats_teachers, "tendency_teachers.csv")
summary_stats_teachers
## # A tibble: 12 × 9
##    Type         Level     n mean_value median_value sd_value min_value max_value
##    <chr>        <fct> <int>      <dbl>        <dbl>    <dbl>     <dbl>     <dbl>
##  1 Compensatio… Pre-…     6       49.0         49.0  6.83e-1      48.1      49.8
##  2 Compensatio… Prim…     6       49.0         49.0  6.83e-1      48.1      49.8
##  3 Compensatio… Lowe…     6       49.0         49.0  6.83e-1      48.1      49.8
##  4 Compensatio… Uppe…     6       49.0         49.0  6.83e-1      48.1      49.8
##  5 Compensatio… Seco…     6       46.8         46.7  9.30e-1      46.1      48.7
##  6 Compensatio… Tert…     6       28.5         28.3  7.90e-1      27.6      30.0
##  7 Number of T… Pre-…    12   531226.      608720.   1.40e+5  302255    651997. 
##  8 Number of T… Prim…    17  1604827.     1687937    1.49e+5 1414000   1774348. 
##  9 Number of T… Lowe…    14   810580.      858595.   8.81e+4  670172    894725. 
## 10 Number of T… Uppe…    14   774670.      813548.   7.75e+4  650603    848440. 
## 11 Number of T… Seco…    15  1501112.     1661375    2.95e+5  784414.  1737206. 
## 12 Number of T… Tert…    22   853184.      747500    3.14e+5  574000   1581424  
## # ℹ 1 more variable: IQR <dbl>

7. Discussion

Interim Takeaways

  • Enrollment vs. Attendance: Attendance rates consistently trail enrollment rates at all levels, though the gap varies by level and year. This may point to structural barriers beyond access.
  • Teachers vs. Compensation: Preliminary trends suggest whether compensation shares grow with staffing. Divergences could signal reallocations or coverage differences.

These observations are preliminary and will guide further analysis.

Reproducibility Notes
All intermediate datasets and figures are written to disk (EdStats_USA.csv, EdStats_attend.csv, EdStats_teacher.csv, and exported PNGs).
This allows others to review and re-run later steps without repeating earlier processing.
Consider adding sessionInfo() at the end to document package versions.


8. Works Cited

reference string

reference string

reference string